getwd()
## [1] "C:/Users/Nirmal/Documents/Python Scripts"
setwd("C:/Users/Nirmal/Documents/Python Scripts")
getwd()
## [1] "C:/Users/Nirmal/Documents/Python Scripts"
data=read.csv("HousingData.csv")
sum(is.na(data))
## [1] 120
is.na(data)
## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [32,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [33,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [35,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [36,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [37,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [38,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [39,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [40,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [41,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [42,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [43,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [44,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [46,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [47,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [48,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [50,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [51,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [52,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [53,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [54,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [55,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [57,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [58,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [59,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [60,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [62,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [63,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [64,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [65,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [66,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [68,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [69,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [70,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [71,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [ reached getOption("max.print") -- omitted 435 rows ]
str(data)
## 'data.frame': 506 obs. of 14 variables:
## $ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ INDUS : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ CHAS : int 0 0 0 0 0 0 NA 0 0 NA ...
## $ NOX : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ RM : num 6.58 6.42 7.18 7 7.15 ...
## $ AGE : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ DIS : num 4.09 4.97 4.97 6.06 6.06 ...
## $ RAD : int 1 2 2 3 3 3 5 5 5 5 ...
## $ TAX : int 296 242 242 222 222 222 311 311 311 311 ...
## $ PTRATIO: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ B : num 397 397 393 395 397 ...
## $ LSTAT : num 4.98 9.14 4.03 2.94 NA ...
## $ MEDV : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
data$CRIM[is.na(data$CRIM)]= mean(data$CRIM, na.rm = TRUE)
data$AGE[is.na(data$AGE)]= mean(data$AGE, na.rm= TRUE)
data$INDUS[is.na(data$INDUS)]= mean(data$INDUS, na.rm= TRUE)
data$LSTAT[is.na(data$LSTAT)]= mean(data$LSTAT, na.rm= TRUE)
data$CHAS[is.na(data$CHAS)]= 0
cor(data)
## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX
## CRIM 1.00000000 NA 0.39116137 -0.053710495 0.41037672 -0.2154338 0.34493361 -0.36652274 0.608886320 0.56652782
## ZN NA 1 NA NA NA NA NA NA NA NA
## INDUS 0.39116137 NA 1.00000000 0.054172460 0.74096466 -0.3814574 0.61459225 -0.69963912 0.593176456 0.71606232
## CHAS -0.05371049 NA 0.05417246 1.000000000 0.07086746 0.1067974 0.07354903 -0.09231841 -0.003339387 -0.03582225
## NOX 0.41037672 NA 0.74096466 0.070867463 1.00000000 -0.3021882 0.71146138 -0.76923011 0.611440563 0.66802320
## RM -0.21543377 NA -0.38145737 0.106797424 -0.30218819 1.0000000 -0.24135070 0.20524621 -0.209846668 -0.29204783
## AGE 0.34493361 NA 0.61459225 0.073549029 0.71146138 -0.2413507 1.00000000 -0.72435308 0.449988663 0.50058938
## DIS -0.36652274 NA -0.69963912 -0.092318410 -0.76923011 0.2052462 -0.72435308 1.00000000 -0.494587930 -0.53443158
## RAD 0.60888632 NA 0.59317646 -0.003339387 0.61144056 -0.2098467 0.44998866 -0.49458793 1.000000000 0.91022819
## TAX 0.56652782 NA 0.71606232 -0.035822250 0.66802320 -0.2920478 0.50058938 -0.53443158 0.910228189 1.00000000
## PTRATIO 0.27338389 NA 0.38480592 -0.109451496 0.18893268 -0.3555015 0.26272340 -0.23247054 0.464741179 0.46085304
## B -0.37016342 NA -0.35459662 0.050607567 -0.38005064 0.1280686 -0.26528227 0.29151167 -0.444412816 -0.44180801
## LSTAT 0.43404449 NA 0.56735384 -0.047807594 0.57237922 -0.6029620 0.57489289 -0.48342926 0.468439666 0.52454474
## MEDV -0.37969547 NA -0.47865733 0.183844439 -0.42732077 0.6953599 -0.38022344 0.24992873 -0.381626231 -0.46853593
## PTRATIO B LSTAT MEDV
## CRIM 0.2733839 -0.37016342 0.43404449 -0.3796955
## ZN NA NA NA NA
## INDUS 0.3848059 -0.35459662 0.56735384 -0.4786573
## CHAS -0.1094515 0.05060757 -0.04780759 0.1838444
## NOX 0.1889327 -0.38005064 0.57237922 -0.4273208
## RM -0.3555015 0.12806864 -0.60296205 0.6953599
## AGE 0.2627234 -0.26528227 0.57489289 -0.3802234
## DIS -0.2324705 0.29151167 -0.48342926 0.2499287
## RAD 0.4647412 -0.44441282 0.46843967 -0.3816262
## TAX 0.4608530 -0.44180801 0.52454474 -0.4685359
## PTRATIO 1.0000000 -0.17738330 0.37334313 -0.5077867
## B -0.1773833 1.00000000 -0.36888621 0.3334608
## LSTAT 0.3733431 -0.36888621 1.00000000 -0.7219746
## MEDV -0.5077867 0.33346082 -0.72197464 1.0000000
anova=aov(MEDV~(CHAS+RAD), data=data)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## CHAS 1 1444 1444 20.71 6.72e-06 ***
## RAD 1 6201 6201 88.94 < 2e-16 ***
## Residuals 503 35071 70
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data=data[,-c(2,4,8,12)]
sum(is.na(data))
## [1] 0
library(plotly)
fig1= plot_ly(data, x = ~CRIM, y = ~MEDV,
type= 'scatter', mode="markers"
)
fig1= fig1 %>% layout(title= "Crime Rate vs Median Vaue")
fig1
fig3= plot_ly(data, x = ~RM, y = ~MEDV,
type="scatter", mode="markers"
)%>%layout(title="RM vs Median Value")
fig3
fig5= plot_ly(data, x = ~LSTAT, y = ~MEDV,
type="scatter", mode="markers"
)%>%layout(title="LSTAT vs Median Value")
fig5
fig6= plot_ly(data, x = ~PTRATIO, y = ~MEDV,
type="scatter", mode="markers"
)%>%layout(title="PTRATIO vs Median Value")
fig6
fig7= plot_ly(data, x = ~TAX, y = ~MEDV,
type="scatter", mode="markers"
)%>%layout(title="TAX vs Median Value")
fig7
fig8= plot_ly(data, x = ~INDUS, y = ~MEDV,
type="scatter", mode="markers"
)%>%layout(title="INDUS vs Median Value")
fig8
fig9= plot_ly(data, x = ~NOX, y = ~MEDV,
type="scatter", mode="markers"
)%>%layout(title="NOX vs Median Value")
fig9
hist(data$CRIM, col="red", main="Histogram of Crime Rate in Boston",
xlab="Crime Rate", ylab="Frequency")

hist(data$TAX, col="red", main="Histogram of TAX rate in Boston",
xlab="TAX", ylab="Frequency")

hist(data$RM, col="blue", main="Histogram of Rooms Per House in Boston",
xlab="Rooms per house", ylab="Frequency")

hist(data$LSTAT, col="black", main="Histogram of Socioeconomic Status in Boston",
xlab="LSTAT", ylab="Frequency")

hist(data$PTRATIO, col="green", main="Histogram of PTRATIO in Boston",
xlab="PTRATIO", ylab="Frequency")

hist(data$INDUS, col="green", main="Histogram of INDUS in Boston",
xlab="INDUS", ylab="Frequency")

hist(data$NOX, col="orange", main="Histogram of Nitric Oxide Concentration In Boston",
xlab="NOX", ylab="Frequency")

hist(data$MEDV, col="pink", main="Histogram of Home Prices in Boston",
xlab="MEDV", ylab="Frequency")

library(xgboost)
library(gbm)
library(caret)
library(caTools)
library(dplyr)
y= data$MEDV
x= data%>%select(-MEDV)
params= list(set.seed=1502, eval_metric="rmse", objective="reg:squarederror")
model_xgboost= xgboost(data=as.matrix(x), label=y, params=params, nrounds = 2, verbose=2)
## [20:35:09] WARNING: src/learner.cc:767:
## Parameters: { "set_seed" } are not used.
##
## [1] train-rmse:17.072899
## [2] train-rmse:12.303709
x$PMEDV= predict(model_xgboost, data.matrix(x))
cor(x$PMEDV,y)
## [1] 0.9392105
set.seed(123)
sample= sample.split(data$MEDV, SplitRatio = 0.70)
trainset= subset(data, sample==TRUE)
testset= subset(data, sample==FALSE)
model_gbm= gbm(MEDV~., data= trainset, distribution="gaussian", cv.folds= 20, shrinkage= 0.01,
n.minobsinnode=10, n.trees= 500)
testset$PMEDV= predict.gbm(model_gbm, testset)
## Using 500 trees...
cor(testset$PMEDV,testset$MEDV)
## [1] 0.766734
model_lm=lm(MEDV~., data=trainset)
testset$P2MEDV= predict(model_lm, testset)
cor(testset$P2MEDV, testset$MEDV)
## [1] 0.6987159
library(MLmetrics)
MSE(y_pred= x$PMEDV, y_true= y)
## [1] 151.3812
MSE(y_pred= testset$PMEDV, y_true= testset$MEDV)
## [1] 28.634
MSE(y_pred= testset$P2MEDV, y_true= testset$MEDV)
## [1] 38.22198
# From the above depicted scatter plots and histograms following conclusions can be made:
# 1. Lower crime rates in each town contribute to higher property values in the towns. Crime rates in Boston seem to be low. It is positively skewed in histogram which means that most of the towns have less crime rates with just few towns having higher crime rates.
# 2. More number of rooms in the houses is another component to deciding the prices of homes in Boston. Subsequently, houses with less number of rooms attribute partly to lower property values. Rooms per dwelling is normally distributed.
# 3. Socioeconomic status is positively skewed which indicates that people belonging to moderate and lower wages are concentrated predominantly in most of the towns with just few towns having predominant people belonging to higher wages category. Higher income earners in the data typically choose to buy properties that are cheap because these people may prioritize other aspects of their lifestyle over big and luxury homes such as spending their money on travel, food, outing etc. They might have better financial planning and investment knowledge too.
# 4. Larger class sizes in the institutions also impact the housing price in Boston. People are attracted to buying houses in those towns wherein the institutions contain lower class strength. This is because in simple words, smaller the class size, better education resources. Smaller classes of the institutions perform relatively well in the exams than bigger classes because of more focus and attention applied on each individual in the class by the teacher. Therefore the house prices tend to be a bit higher in these towns.
# 5. Higher tax in the towns leads to lower property values as seen in the scatter plot. With just few towns offering high price for the houses where the tax rate is low. This is because homeowners tend to add more value to their property when they only have to pay less tax.
# 6. People are more interested to live in towns with retail businesses than in towns with non- retail businesses because they don't have to travel long distances for cloths, groceries, shopping etc. In the scatter plot, higher INDUS indicates people's dislike to live in such towns where there are more non-retail businesses therefore the homes in these towns are less priced.
# 7. Finally, Nitric oxide concentration can be useful too in assessing the overall home prices. Higher NOX implies a signal for air pollution. As we all know, we don't prefer those places where pollution is more. Home prices are high in those towns were Nitric oxide concentration is seen very low. Towns with lower NOX indicate less air pollution therefore living there seems to more peaceful and cleaner. NOX is positively skewed which shows that most of the towns in Boston are less prone to air pollution because it contains less NOX concentration.
# 8. Eventually we have used three algorithms namely Xgboost, Gradient Boosting Algorithm, and Linear Regression to predict the median value of home prices in Boston and subsequently checked the accuracy of each to show which has performed well on our model.